Lecture 23
Original data from Antonio, Almeida, and Nunes (2019), Data dictionary
Rows: 50,000
Columns: 23
$ hotel <fct> City_Hotel, City_Hotel, Resort_Hotel, Resort_Hotel, Re…
$ lead_time <dbl> 217, 2, 95, 143, 136, 67, 47, 56, 80, 6, 130, 27, 16, …
$ stays_in_weekend_nights <dbl> 1, 0, 2, 2, 1, 2, 0, 0, 0, 2, 1, 0, 1, 0, 1, 1, 1, 4, …
$ stays_in_week_nights <dbl> 3, 1, 5, 6, 4, 2, 2, 3, 4, 2, 2, 1, 2, 2, 1, 1, 2, 7, …
$ adults <dbl> 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, …
$ children <fct> none, none, none, none, none, none, children, children…
$ meal <fct> BB, BB, BB, HB, HB, SC, BB, BB, BB, BB, BB, BB, BB, BB…
$ country <fct> DEU, PRT, GBR, ROU, PRT, GBR, ESP, ESP, FRA, FRA, FRA,…
$ market_segment <fct> Offline_TA/TO, Direct, Online_TA, Online_TA, Direct, O…
$ distribution_channel <fct> TA/TO, Direct, TA/TO, TA/TO, Direct, TA/TO, Direct, TA…
$ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ reserved_room_type <fct> A, D, A, A, F, A, C, B, D, A, A, D, A, D, A, A, D, A, …
$ assigned_room_type <fct> A, K, A, A, F, A, C, A, D, A, D, D, A, D, A, A, D, A, …
$ booking_changes <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, …
$ deposit_type <fct> No_Deposit, No_Deposit, No_Deposit, No_Deposit, No_Dep…
$ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 0, 0, 0…
$ customer_type <fct> Transient-Party, Transient, Transient, Transient, Tran…
$ average_daily_rate <dbl> 80.75, 170.00, 8.00, 81.00, 157.60, 49.09, 289.00, 82.…
$ required_car_parking_spaces <fct> none, none, none, none, none, none, none, none, none, …
$ total_of_special_requests <dbl> 1, 3, 2, 1, 4, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, …
$ arrival_date <date> 2016-09-01, 2017-08-25, 2016-11-19, 2016-04-26, 2016-…
Our goal is to develop a predictive model that is able to predict whether a booking will include children or not based on the other characteristics of the booking.
# A tibble: 2 × 3
children n prop
<fct> <int> <dbl>
1 children 3027 0.0807
2 none 34473 0.919
# A tibble: 2 × 3
children n prop
<fct> <int> <dbl>
1 children 1011 0.0809
2 none 11489 0.919
holidays = c("AllSouls", "AshWednesday", "ChristmasEve", "Easter",
"ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")
lr_recipe = recipe(children ~ ., data = hotel_train) |>
step_date(arrival_date) |>
step_holiday(arrival_date, holidays = holidays) |>
step_rm(arrival_date) |>
step_rm(country) |>
step_dummy(all_nominal_predictors()) |>
step_zv(all_predictors())
lr_recipe# A tibble: 37,500 × 76
lead_time stays_in_weekend_nights stays_in_week_nights adults is_repeated_guest
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 0 1 2 0
2 95 2 5 2 0
3 67 2 2 2 0
4 47 0 2 2 0
5 56 0 3 0 0
6 6 2 2 2 0
7 130 1 2 2 0
8 27 0 1 1 0
9 46 0 2 2 0
10 423 1 1 2 0
# ℹ 37,490 more rows
# ℹ 71 more variables: previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
# booking_changes <dbl>, days_in_waiting_list <dbl>, average_daily_rate <dbl>,
# total_of_special_requests <dbl>, children <fct>, arrival_date_year <int>,
# arrival_date_AllSouls <int>, arrival_date_AshWednesday <int>, arrival_date_ChristmasEve <int>,
# arrival_date_Easter <int>, arrival_date_ChristmasDay <int>, arrival_date_GoodFriday <int>,
# arrival_date_NewYearsDay <int>, arrival_date_PalmSunday <int>, hotel_Resort_Hotel <dbl>, …
══ Workflow ════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────
6 Recipe Steps
• step_date()
• step_holiday()
• step_rm()
• step_rm()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
══ Workflow [trained] ══════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────
6 Recipe Steps
• step_date()
• step_holiday()
• step_rm()
• step_rm()
• step_dummy()
• step_zv()
── Model ───────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept)
-2.543e+02
lead_time
-1.287e-03
stays_in_weekend_nights
5.231e-02
stays_in_week_nights
-3.433e-02
adults
7.328e-01
is_repeated_guest
3.962e-01
previous_cancellations
2.147e-01
previous_bookings_not_canceled
3.728e-01
booking_changes
-2.396e-01
days_in_waiting_list
6.415e-03
average_daily_rate
-1.049e-02
total_of_special_requests
-4.936e-01
arrival_date_year
1.344e-01
arrival_date_AllSouls
1.006e+00
arrival_date_AshWednesday
2.019e-01
arrival_date_ChristmasEve
5.328e-01
arrival_date_Easter
-9.749e-01
arrival_date_ChristmasDay
-6.875e-01
arrival_date_GoodFriday
-1.593e-01
arrival_date_NewYearsDay
-1.185e+00
arrival_date_PalmSunday
-6.243e-01
hotel_Resort_Hotel
9.581e-01
meal_FB
-6.348e-01
...
and 110 more lines.
( lr_train_perf = lr_fit |>
augment(new_data = hotel_train) |>
select(children, starts_with(".pred")) )# A tibble: 37,500 × 4
children .pred_class .pred_children .pred_none
<fct> <fct> <dbl> <dbl>
1 none none 0.0861 0.914
2 none none 0.0178 0.982
3 none none 0.0101 0.990
4 children children 0.931 0.0693
5 children none 0.473 0.527
6 children none 0.144 0.856
7 none none 0.0710 0.929
8 none none 0.0596 0.940
9 none none 0.0252 0.975
10 none none 0.0735 0.926
# ℹ 37,490 more rows
( lr_test_perf = lr_fit |>
augment(new_data = hotel_test) |>
select(children, starts_with(".pred")) )# A tibble: 12,500 × 4
children .pred_class .pred_children .pred_none
<fct> <fct> <dbl> <dbl>
1 none none 0.00854 0.991
2 none none 0.0202 0.980
3 none children 0.757 0.243
4 none none 0.0373 0.963
5 none none 0.000975 0.999
6 none none 0.000474 1.00
7 none none 0.0736 0.926
8 none none 0.0748 0.925
9 none none 0.0532 0.947
10 none none 0.0794 0.921
# ℹ 12,490 more rows
Truth
Prediction children none
children 1075 420
none 1952 34053
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.937
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.719
Truth
Prediction children none
children 359 137
none 652 11352
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.937
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.724
lr_train_roc = lr_train_perf |>
yardstick::roc_curve(
children, .pred_children
) |>
mutate(name="logistic - train")
lr_test_roc = lr_test_perf |>
yardstick::roc_curve(
children, .pred_children
) |>
mutate(name="logistic - test")
bind_rows(
lr_train_roc,
lr_test_roc
) |>
ggplot(aes(x = 1 - specificity, y = sensitivity, col = name)) +
geom_path(lwd = 1.5, alpha = 0.8) +
geom_abline(lty = 3) +
coord_equal()For this we will be using the glmnet package which supports fitting lasso, ridge and elastic net models.
Collection of 1 parameters for tuning
identifier type object
penalty penalty nparam[+]
Lasso (and Ridge) models are sensitive to the scale of the model features, and so a standard approach is to normalize all features before fitting the model.
# A tibble: 37,500 × 76
lead_time stays_in_weekend_nights stays_in_week_nights
<dbl> <dbl> <dbl>
1 -0.858 -0.938 -0.767
2 0.160 1.09 1.32
3 -0.146 1.09 -0.245
4 -0.365 -0.938 -0.245
5 -0.267 -0.938 0.278
6 -0.814 1.09 -0.245
7 0.544 0.0735 -0.245
8 -0.584 -0.938 -0.767
9 -0.376 -0.938 -0.245
10 3.75 0.0735 -0.767
# ℹ 37,490 more rows
# ℹ 73 more variables: adults <dbl>,
# is_repeated_guest <dbl>, previous_cancellations <dbl>,
# previous_bookings_not_canceled <dbl>,
# booking_changes <dbl>, days_in_waiting_list <dbl>,
# average_daily_rate <dbl>,
# total_of_special_requests <dbl>, children <fct>, …
══ Workflow ════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────
7 Recipe Steps
• step_date()
• step_holiday()
• step_rm()
• step_rm()
• step_dummy()
• step_zv()
• step_normalize()
── Model ───────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Main Arguments:
penalty = tune()
mixture = 1
Computational engine: glmnet
# 5-fold cross-validation using stratification
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [30000/7500]> Fold1
2 <split [30000/7500]> Fold2
3 <split [30000/7500]> Fold3
4 <split [30000/7500]> Fold4
5 <split [30000/7500]> Fold5
( lasso_grid = lasso_work |>
tune_grid(
hotel_vf,
grid = tibble(
penalty = 10^seq(-4, -1, length.out = 10)
),
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc)
)
)# Tuning results
# 5-fold cross-validation using stratification
# A tibble: 5 × 5
splits id .metrics .notes .predictions
<list> <chr> <list> <list> <list>
1 <split [30000/7500]> Fold1 <tibble> <tibble> <tibble>
2 <split [30000/7500]> Fold2 <tibble> <tibble> <tibble>
3 <split [30000/7500]> Fold3 <tibble> <tibble> <tibble>
4 <split [30000/7500]> Fold4 <tibble> <tibble> <tibble>
5 <split [30000/7500]> Fold5 <tibble> <tibble> <tibble>
# A tibble: 10 × 7
penalty .metric .estimator mean n std_err
<dbl> <chr> <chr> <dbl> <int> <dbl>
1 0.0001 roc_auc binary 0.877 5 0.00318
2 0.000215 roc_auc binary 0.877 5 0.00316
3 0.000464 roc_auc binary 0.877 5 0.00314
4 0.001 roc_auc binary 0.877 5 0.00304
5 0.00215 roc_auc binary 0.877 5 0.00263
6 0.00464 roc_auc binary 0.870 5 0.00253
7 0.01 roc_auc binary 0.853 5 0.00249
8 0.0215 roc_auc binary 0.824 5 0.00424
9 0.0464 roc_auc binary 0.797 5 0.00400
10 0.1 roc_auc binary 0.5 5 0
# ℹ 1 more variable: .config <chr>
# A tibble: 10 × 7
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.001 roc_auc binary 0.877 5 0.00304 Preproce…
2 0.00215 roc_auc binary 0.877 5 0.00263 Preproce…
3 0.000464 roc_auc binary 0.877 5 0.00314 Preproce…
4 0.000215 roc_auc binary 0.877 5 0.00316 Preproce…
5 0.0001 roc_auc binary 0.877 5 0.00318 Preproce…
6 0.00464 roc_auc binary 0.870 5 0.00253 Preproce…
7 0.01 roc_auc binary 0.853 5 0.00249 Preproce…
8 0.0215 roc_auc binary 0.824 5 0.00424 Preproce…
9 0.0464 roc_auc binary 0.797 5 0.00400 Preproce…
10 0.1 roc_auc binary 0.5 5 0 Preproce…
Since we used control_grid(save_pred = TRUE) with tune_grid() we can recover the predictions for the out-of-sample values for each fold:
# A tibble: 37,500 × 7
id .pred_children .pred_none .row penalty children
<chr> <dbl> <dbl> <int> <dbl> <fct>
1 Fold1 0.366 0.634 5 0.00215 children
2 Fold1 0.144 0.856 6 0.00215 children
3 Fold1 0.0542 0.946 19 0.00215 none
4 Fold1 0.0266 0.973 21 0.00215 none
5 Fold1 0.106 0.894 22 0.00215 children
6 Fold1 0.0286 0.971 23 0.00215 none
7 Fold1 0.0205 0.980 30 0.00215 none
8 Fold1 0.0192 0.981 31 0.00215 none
9 Fold1 0.0431 0.957 32 0.00215 none
10 Fold1 0.0532 0.947 35 0.00215 none
# ℹ 37,490 more rows
# ℹ 1 more variable: .config <chr>
Typically with a tuned model we will refit using the complete test data and the “best” parameter value(s),
Truth
Prediction children none
children 330 109
none 681 11380
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.937
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.752
# A tibble: 5 × 2
engine mode
<chr> <chr>
1 rpart classification
2 rpart regression
3 C5.0 classification
4 spark classification
5 spark regression
We skip dummy coding in the recipe as it is not needed by rpart,
# A tibble: 27 × 3
cost_complexity tree_depth min_n
<dbl> <int> <int>
1 0.0000000001 1 2
2 0.00000316 1 2
3 0.1 1 2
4 0.0000000001 8 2
5 0.00000316 8 2
6 0.1 8 2
7 0.0000000001 15 2
8 0.00000316 15 2
9 0.1 15 2
10 0.0000000001 1 21
# ℹ 17 more rows
# A tibble: 27 × 9
cost_complexity tree_depth min_n .metric .estimator mean
<dbl> <int> <int> <chr> <chr> <dbl>
1 0.0000000001 15 21 roc_auc binary 0.867
2 0.00000316 15 21 roc_auc binary 0.867
3 0.0000000001 15 40 roc_auc binary 0.863
4 0.00000316 15 40 roc_auc binary 0.863
5 0.0000000001 8 21 roc_auc binary 0.848
6 0.00000316 8 21 roc_auc binary 0.848
7 0.0000000001 8 40 roc_auc binary 0.846
8 0.00000316 8 40 roc_auc binary 0.846
9 0.0000000001 8 2 roc_auc binary 0.843
10 0.00000316 8 2 roc_auc binary 0.843
# ℹ 17 more rows
# ℹ 3 more variables: n <int>, std_err <dbl>, .config <chr>
# A tibble: 5 × 9
cost_complexity tree_depth min_n .metric
<dbl> <int> <int> <chr>
1 0.0000000001 15 21 roc_auc
2 0.00000316 15 21 roc_auc
3 0.0000000001 15 40 roc_auc
4 0.00000316 15 40 roc_auc
5 0.0000000001 8 21 roc_auc
# ℹ 5 more variables: .estimator <chr>,
# mean <dbl>, n <int>, std_err <dbl>,
# .config <chr>
# A tibble: 1 × 4
cost_complexity tree_depth min_n .config
<dbl> <int> <int> <chr>
1 0.0000000001 15 21 Preprocessor1_Model16
Truth
Prediction children none
children 444 270
none 567 11219
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.933
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.622
# A tibble: 6 × 2
engine mode
<chr> <chr>
1 ranger classification
2 ranger regression
3 randomForest classification
4 randomForest regression
5 spark classification
6 spark regression
We skip dummy coding in the recipe as it is not needed by ranger,
# A tibble: 10 × 8
mtry min_n .metric .estimator mean n std_err
<int> <int> <chr> <chr> <dbl> <int> <dbl>
1 5 3 roc_auc binary 0.918 5 0.00195
2 9 31 roc_auc binary 0.916 5 0.00181
3 10 21 roc_auc binary 0.915 5 0.00183
4 15 23 roc_auc binary 0.912 5 0.00177
5 18 38 roc_auc binary 0.911 5 0.00251
6 21 28 roc_auc binary 0.910 5 0.00198
7 24 10 roc_auc binary 0.908 5 0.00289
8 26 35 roc_auc binary 0.907 5 0.00192
9 30 15 roc_auc binary 0.907 5 0.00237
10 2 11 roc_auc binary 0.899 5 0.00259
# ℹ 1 more variable: .config <chr>
# A tibble: 5 × 8
mtry min_n .metric .estimator mean n
<int> <int> <chr> <chr> <dbl> <int>
1 5 3 roc_auc binary 0.918 5
2 9 31 roc_auc binary 0.916 5
3 10 21 roc_auc binary 0.915 5
4 15 23 roc_auc binary 0.912 5
5 18 38 roc_auc binary 0.911 5
# ℹ 2 more variables: std_err <dbl>,
# .config <chr>
# A tibble: 1 × 3
mtry min_n .config
<int> <int> <chr>
1 5 3 Preprocessor1_Model03
Truth
Prediction children none
children 388 70
none 623 11419
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.945
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.847
Sta 523 - Fall 2023